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Abstract Convolutional neural networks have recently shown excellent 
results in general object detection and many other tasks. Albeit very 
effective, they involve many user-defined design choices. In this paper we 
want to better understand these choices by inspecting two key aspects 
“what did the network learn?”, and “what can the network learn?”. We 
exploit new annotations (Pascal3D+), to enable a new empirical analysis 
of the R-CNN detector. Despite common belief, our results indicate that 
existing state-of-the-art convnet architectures are not invariant to various 
appearance factors. In fact, all considered networks have similar weak 
points which cannot be mitigated by simply increasing the training data 
(architectural changes are needed). We show that overall performance 
can improve when using image renderings for data augmentation. We 
report the best known results on the Pascal3D-|- detection and view¬ 
point estimation tasks. 


1 Introduction 

In the last years convolutional neural networks (convnets) have become “the 
hammer that pounds many nails” of computer vision. Classical problems such as 
general image classification im. object detection m, pose estimation jj, face 
recognition ED], object tracking |20| . keypoint matching CDI, stereo matching 
|42| , optical flow [S] , boundary estimation m. and semantic labelling ED, have 
now all top performing results based on a direct usage of convnets. The price 
to pay for such versatility and good results is a limited understanding of why 
convnets work so well, and how to build & train them to reach better results. 

In this paper we focus on convnets for object detection. For many object 
categories convnets have almost doubled over previous detection quality. Yet, it 
is unclear what exactly enables such good performance, and critically, how to 
further improve it. The usual word of wisdom for better detection with convnets 
is “larger networks and more data”. But: how should the network grow; which 
kind of additional data will be most helpful; what follows after fine-tuning an 
ImageNet pre-trained model on the classes of interest? We aim at addressing 
such questions in the context of the R-CNN detection pipeline m (®- 

Previous work aiming to analyse convnets have either focused on theoret¬ 
ical aspects E), visualising some specific patterns emerging inside the network 
|18I:RI33I22| . or doing ablation studies of working systems mm- However, it 
remains unclear what is withholding the detection capabilities of convnets. 

Contributions This paper contributes a novel empirical exploration of R-CNNs 
for detection. We use the recently available Pascal3D+E3] dataset, as well as 
rendered images to analyze R-CNNs capabilities at a more detailed level than 
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previous work. In a new set of experiments we explore which appearance factors 
are well captured by a trained R-CNN, and which ones are not. We consider 
factors such as rotation (azimuth, elevation), size, category, and instance shape. 
We want to know which aspects can be improved by simply increasing the train¬ 
ing data, and which ones require changing the network. We want to answer both 
“what did the network learn?” and “what can the network learn?” and 
0- Our results indicate that current convnets (AlexNet [17] . GoogleNet [35], 
VGG16 |31|) struggle to model small objects, truncation, and occlusion and are 
not invariant to these factors. Simply increasing the training data does solve 
these issues. On the other hand, properly designed synthetic training data can 
help pushing forward the overall detection performance. 

1.1 Related work 

Understanding convnets The tremendous success of convnets coupled with 
their black-box nature has drawn much attention towards understanding them 
better. Previous analyses have either focused on highlighting the versatility of 
its features [ZSET], learning equivariant mappings |19| . training issues 13161, 
theoretical arguments for its expressive power |2], discussing the brittleness of 
the decision boundary IMIIll, visualising specific patterns emerging inside the 
network |l8l3ll33l22| . or doing ablation studies of working systems mm- 
We leverage the recent Pascal3D+ annotations to do a new analysis comple¬ 
mentary to previous ones. Rather than aiming to explain how does the network 
work, we aim at identifying in which cases the network does not work well, 
and if training data is sufficient to improve these issues. While previous work 
has shown that convnet representations are increasingly invariant with depth, 
here we show that current architectures are still not overall invariant to many 
appearance factors. 

Synthetic data The idea of using rendered images to train detectors has been 
visited multiple times. Some of the strategies considered include video game 
renderings |41| (aiming at photo-realism), GAD model wire-frame renderings 
|34I25| (focusing on object boundaries), texture-mapped GAD models |29I23| . 
or augmenting the training set by subtle deformations of the positive samples 

TO- 

Most of these works focused on DPM-like detectors, which can only make limited 
use of large training sets |43| . In this paper we investigate how different types 
of renderings (wire-frame, materials, and textures) impact the performance of a 
convnet. A priori convnets are more suitable to ingest larger volumes of data. 

2 The R-CNN detector 

The remarkable convnet results in the ImageNet 2012 classification competition 
m ignited a new wave of neural networks for computer vision. R-GNN m 
adapts such convnets for the task of object detection, and has become the de- 
facto architecture for state-of-the-art object detection (with top results on Pascal 
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VOC |8] and ImageNet j6]) and is thus the focus of attention in this paper. The R- 
CNN detector is a three stage pipeline: object proposal generation [3H], convnet 
feature extraction, and one-vs-all SVM classification. We refer to the original 
paper for details of the training procedure m- Different networks can be used 
for feature extraction (AlexNet [T7], VGG O, GoogleNet |3S|)) all pre-trained 
on ImageNet and fine-tuned for detection. The larger the network, the better the 
performance. The SVM gains a couple of final mAP points compared to logistic 
regression used during fine-tuning (and larger networks benefit less from it |ll|b 
In this work we primarily focus on the core ingredient: convnet fine-tuning 
for object detection. We consider fine-tuning with various training distributions, 
and analyse the performance under various appearance factors. Unless otherwise 
specified reported numbers include the SVM classification stage, but not the 
bounding box regression. 


3 Pascal3D+ dataset 

Our experiments are enabled by the recently introduced Pascal3D+ [39] dataset. 
It enriches PASGAL VOG 2012 with 3D annotations in the form of aligned 3D 
GAD models for 11 classes (aeroplane, bicycle, boat, bus, car, chair, diningtable, 
motorbike, sofa, train, and tv monitor) of the train and val subsets. The align¬ 
ments are obtained through human supervision, by first selecting the visually 
most similar GAD model for each instance, and specifying the correspondences 
between a set of 3D GAD model keypoints and their image projections, which 
are used to compute the 3D pose of the instance in the image. The rich object 
annotations include object pose and shape, and we use them as a test bed for 
our analysis. Unless otherwise stated all presented models are trained on the 
Pascal3D+ train set and evaluated on its test set (Pascal VOG 2012 val). 

4 Synthetic images 

Gonvnets reach high classification/detection quality by using a large parametric 
model (e.g. in the order of 10^ parameters). The price to pay is that convnets need 
a large training set to reach top performance. We want to explore whether the 
performance scales as we increase the amount of training data. To that end, we 
explore two possible directions to increase the data volume: data augmentation 
and synthetic data generation. 

Data augmentation consists of creating new training samples by simple trans¬ 
formations of the original ones (such as scaling, cropping, blurring, subtle colour 
shifts, etc.), and it is a common practice during training on large convnets [TTEI. 
To generate synthetic images we rely on GAD models of the object classes of 
interest. Rendering synthetic data has the advantage that we can generate large 
amounts of training data in a controlled setup, allowing for arbitrary appearance 
factor distributions. For our synthetic data experiments we use an extended set 
of GAD models, and consider multiple types of renderings (^4.1[). 
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(a) Real image (b) Wire-frame 

Figure 1: Example training samples for 
Pascal3D+ training set. 


(c) Plain texture (d) Texture transfer 
different type of synthetic rendering. 


Extended Pascal3D+ CAD models Although the Pascal3D+ dataset [35] 
comes with its own set of CAD models, this set is rather small and it comes 
without material information (only polygonal mesh). Thus the Pascal3D+ mod¬ 
els alone are not sufficient for our analysis. We extend this set with models col¬ 
lected from internet resources. We use an initial set of ^ 40 models per class. For 
each Pascal3D+ training sample we generate one synthetic version per model 
using a “plain texture” rendering (see next section) with the same camera-to- 
object pose. We select suitable CAD models by evaluating the R-CNN (trained 
on Pascal 2007 train set) on the rendered images, and we keep a model if it gen¬ 
erates the highest scoring response (across CAD models) for at least one training 
sample. This procedure makes sure we only use CAD models that generate some¬ 
what realistic images close to the original training data distribution, and makes 
it easy to prune unsuitable models. Out of ~440 initial models, ~275 models 
pass the selection process (~25 models per class). 


4.1 Rendering types 

A priori it is unclear which type of rendering will be most effective to build or 
augment a convnet training set. We consider multiple options using the same set 
of CAD models. Note that all rendering strategies exploit the Pascal3D+ data 
to generate training samples with a distribution similar to the real data (similar 
size and orientation of the objects). See Fig. [^for example renderings. 

Wire-frame Using a white background, shape boundaries of a CAD model are 
rendered as black lines. This rendering reflects the shape (not the mesh) of the 
object, abstracting its texture or material properties and might help the detector 
to focus on the shape aspects of the object. 

Plain texture A somewhat more photo-realistic rendering considers the mater¬ 
ial properties (but not the textures), so that shadows are present. We considered 
using a blank background, or an environment model to generate plausible back¬ 
grounds. We obtain slightly improved results using the plausible backgrounds, 
and thus only report these results. This rendering provides “toy car” type im¬ 
ages, that can be considered as middle ground between ‘Vire frame” and “texture 
transfer” rendering. 
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Texture transfer All datasets suffer from bias m, and it is hard to identify 
it by hand. Ideally, synthetic renderings should have the same bias as the real 
data, while injecting additional diversity. We aim at solving this by generating 
new training samples via texture transfer. For a given annotated object on the 
Pascal3D+ dataset, we have both the image it belongs to and an aligned 3D 
CAD model. We create a new training image by replacing the object with a new 
3D CAD model, and by applying over it a texture coming from a different image. 
This approach allows to generate objects with slightly different shapes, and with 
different textures, while still adequately positioned in a realistic background 
context (for now, our texture transfer approach ignores occlusions). This type 
of rendering is close to photo-realistic, using real background context, while 
increasing the diversity by injecting new object shapes and textures. 

As we will see in 0 it turns out that any of our renderings can be used to 
improve detection performance. However the degree of realism affects how much 
improvement is obtained. 


5 What did the network learn from real data? 

In this section we analyze R-CNNs detection performance in an attempt to 
understand what have the models actually learned. We first explore models per¬ 
formance across different appearance factors ((5.11, going beyond the usual per- 
class detection performance. Second, we dive deeper and aim at understanding 
what have the network layers actually learned ((5.2). 

5.1 Detection performance across appearance factors 


To analyze the performance across appearance factors we split each factor into 
equi-spaced bins. We present a new evaluation protocol where for each bin only 
the data falling in it are actually considered in the evaluation and the rest are 
ignored. This allows to dissect the detection performance across different aspects 
of an appearance factor. The original R-CNN[T2] work includes a similar analysis 
based on the toolkit from [15]. Pascal3D+ however enables a more fine-grained 
analysis. Our experiments report results for AlexNet (51.2 mAP)[T7j, GoogleNet 
(56.6 mAPl|55]. VGG16 (58.8 mAP)|32] and their combination (62.4 mAP). 

Appearance factors We focus the evaluation on the following appearance 
factors: rotation (azimuth, elevation), size, occlusion and truncation as these 
factors have strong impact on objects appearance. Azimuth and elevation refer 
to the angular camera position w.r.t. the object. Size refers to the bounding 
box height. Although the Pascal3D+ dataset comes with binary occlusion and 
truncation states, using the aligned GAD models and segmentation masks we 
compute level of occlusion as well as level and type of truncation. While occlu¬ 
sion and truncation levels are expressed as object area percentage, we distinguish 
between 4 truncation types: bottom (b), top (t), left (1) and right (r) truncation. 

Analysis Fig. reports performance across the factors. The results point to 
multiple general observations. First, there is a clear ordering among the models. 
VGG16 is better than GoogleNet on all factor bins, which in turn consistently 
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outperforms AlexNet. The combination of the three 
models (SVM trained on concatenated features) 
consistently outperforms all of them suggesting 
there is underlying complementarity among the net¬ 
works. Second, the relative strengths and weak¬ 
nesses across the factors remain the same across 
models. All networks struggle with occlusions, trun¬ 
cations, and objects below 120 pixels in height. 
Third, for each factor the performance is not ho¬ 
mogeneous across bins, suggesting the networks are 
not invariant w.r.t. the appearance factors. 

It should be noted that there are a few con¬ 
founding factors in the results. First such factor is 
the image support (pixel area) of the object, which 
is strongly correlated with performance. Whenever 
the support is smaller e.g. small sizes, large occlu¬ 
sions/truncations or frontal views the performance 
is lower. Second confounding factor is the training 
data distribution. For a network with a finite num¬ 
ber of parameters, it needs to decide to which cases 
it will allocate resources. The loss used during train¬ 
ing will push the network to handle well the most 
common cases, and disregard the rare cases. Typical 
example is the elevation, where the models learn to 
handle well the near 0° cases (most represented), 
while they all fail on the outliers: upper (90°)and 
lower (—90°) cases. We explore precisely this aspect 
in section where we investigate performance un¬ 
der different training distributions. 

Conclusion There is a clear performance ordering 
among the convnets which all have similar weak¬ 
nesses, tightly related to data distribution and ob¬ 
ject area. Occlusion, truncation, and small size ob¬ 
jects are clearly weak points of the R-CNN detectors 
(arguably harder problems by themselves). Given 
similar tendencies next sections focus on AlexNet. 


5.2 Appearance vector disentanglement 

Other than just the raw detection quality, we are 
interested in understanding what did the network 
learn internally. While previous work focused on 
specific neuron activations m, we aim at analyz¬ 
ing the feature representations of individual layers. 
Given a trained network, we apply it over positive 



elevation 




truncationType 



Figure 2: mAP of R-GNN 
over appearance factors. 
Pascal3D+ dataset. 
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(a) Class (b) Azimnth (c) Elevation (d) Shape 


Figure 3: Average cluster entropy versus number of clusters K\ at different layers, 
for different appearance factors. Pascal3D+ test data. 

test samples, and cluster the feature vectors at a given layer. We then inspect 
the cluster entropy with respect to different appearance factors, as we increase 
the number of clusters. The resulting curves are shown in Fig. Lower average 
entropy indicates that at the given layer the network is able to disentangle the 
considered appearance factor. Disentanglement relates to discriminative power, 
invariance, and equivariance. (Related entropy based metric is reported in [T], 
however they focus on individual neurons). 

Analysis From Fig.j^we see that classes are well disentangled. As we go from 
the lowest convl layer to the highest fc7 layer the disentanglement increases, 
showing that with depth the network layers become more variant w.r.t. category. 
This is not surprising as the network has been trained to distinguish classes. On 
the other hand for azimuth, elevation and shape (class-specific disentanglement) 
the disentanglement across layers and across cluster number stays relatively con¬ 
stant, pointing out that the layers are not as variant to these factors. We also 
applied this evaluation over plain texture renderings (see ^ and 0 to evaluate 
the disentanglement of CAD models, the result is quite similar to Fig. 

Conclusion We make two observations. First, convnet representations at higher 
layers disentangle object categories well, explaining its strong recognition per¬ 
formance. Second, network layers are to some extent invariant to different factors. 


6 What could the network learn with more data? 


Section inspected what the network learned when trained with the original 
training set. In this section we explore what the network could learn if ad- 

and 


ditional data is available. We will focus on size (S6.ll, truncation ( ^6.2[ ) 
occlusion (S6.3) cases since these are aspects that R-CNNs struggle to handle. 
For each case we consider two general approaches: changing the training data 
distribution, or using additional supervision during training. For the former we 
use data augmentation to generate additional samples for specific size, occlusion, 
or truncation bins. Augmenting the training data distribution helps us realize if 
adding extra training data for a specific factor bin helps improving the perform¬ 
ance on that particular bin. When using additional supervision, we leverage the 
annotations to train a separate model for each bin. Providing an explicit signal 
during training forces the network to distinguish among specific factor bins. The 
experiments involve fine-tuning the R-CNN only (no SVM on top) as we are 
interested in convnet modelling capabilities. 
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Detection performance vs object size 
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Figure 4: Training with varying object size distribution. 


6,1 Size handling Detection precision versus truncation 

Fig. 1^ shows the results with different 
object size training distributions. 

More data The “original” bars cor¬ 
respond to the results in Fig. i“Up 
& downscale” corresponds to training 
with a uniform size distribution across 
bins by up/down-scaling all training 
samples to all bins. As upscaled im¬ 
ages will be blurry, “downscale only” 
avoids such blur, resulting in a distri¬ 
bution with more small size training 
samples than larger sizes. Results in 
Fig-i indicate that data augmenta¬ 
tion across sizes can provide a couple 
of niAP points gain for small objects, 
however the network still struggles 
with small objects, thus it is not in¬ 
variant w.r.t. size despite the uniform 
training distribution. 

Bin-specific models The right side 
bars of Fig. show results for bin- 
specific networks. Each bar corresponds to a model trained and tested on that 
size range. Both augmentation methods outperform the original data distribu¬ 
tion on all size bins (e.g. at 195 pixels, “up & downscale” improves by 5.2 niAP). 
In “comb size” we combine the “up & downscale” size specific models via an SVM 
trained on their concatenated features. This results in superior overall perform¬ 
ance (54.0 mAP) w.r.t. the original data (51.2 mAP with SVM). 

Conclusion These results indicate that a) adding data uniformly across sizes 
provides mild gains for small objects and does not result in size invariant models, 
suggesting that the models suffer from limited capacity and b) training bin- 
specific models results in better per bin and overall performance. 



Figure 5: Training with varying trun¬ 
cated objects distribution. 

Detection precision versus occlusion level 



Occiusion level 


Figure 6: Training with varying oc¬ 
cluded objects distribution. 


6.2 Truncation handling 

More data Fig.|^shows that generating truncated samples from non-truncated 
ones, respecting the original data distribution, help improve (1.5 mAP points) 
































































What is Holding Back Convnets for Detection? 


9 


handling objects with minimal truncation; but does not improve medium or large 
truncation handling (trend for top, left and right is similar to the shown bottom 
case). Using biased distributions provided worse results. 

Bin-specific models Similar to the “more data” case, training a convnet for 
each specific truncation cases only helps for the low truncation cases, but is 
ineffective for medium or large truncations. 

Conclusion These results are a clear indication that training data do not help 
per-se handling this case. Architectural changes to the detector seem required 
to obtain a meaningful improvement. 


6.3 Occlusion handling 


Similar to the truncation case, Fig.|^shows that specialising a network for each 
occlusion case is only effective for the low occlusion case. Medium/high occlusion 
cases are a “distraction” for training non-occluded object detection. 


Conclusion For truncation and occlusion, it seems that architectural changes 
are needed to obtain significant improvements. Simply adding training data or 
focusing the network on sub-tasks seems insufficient. 

Synthetic Ratio 

7 Does synthetic data help? _ 


type 


ReahSynth. 


mAP 


- 

1:0 

47.6 

Wire-frame 

0:1 

21.8 

Plain texture 

0:1 

23.5 

Texture transfer 

0:1 

38.4 

Wire-frame 

1:2 

48.3 

Plain texture 

1:2 

49.9 

Texture transfer 

1:2 

51.5 


We have seen that convnets have weak 
spots for object detection, and adding 
data results in limited gains. As convnets 
are data hungry methods, the question 
remains what happens when more data 
from the same training distribution is in¬ 
troduced. Obtaining additional annotated 
training data is expensive, thus we con¬ 
sider the option of using renderings. Our 
results with renderings (see Q are summarised in Tab.[^ Again we focus on fine- 
tuning convnets only. All renderings are done using a similar data distribution 
as the original one, aiming to improve on common cases. 


Table 1: Results with different syn¬ 
thetic data type. Pascal3D+ test. 


Analysis From Tab. [^we observe that using synthetic data alone (0:1 ratio) 
under-performs compared to using real data, showing there is still room for im¬ 
provement on the synthetic data itself. That being said, we observe that even 
the arguably weak wire-frame renderings do help improve detections when used 
as an extension of the real data. We empirically chose data ratio of 1:2 between 
real and synthetic as that seemed to strike good balance among the two data 
sources. As expected, the detection improvement is directly proportional to the 
photo-realism (see Tab.[^. This indicates that further gains can be expected as 
photo-realism is improved. Our texture transfer approach is reasonably effect¬ 
ive, with a 4 mAP points improvement. Wire-frame renderings inject information 
from the extended CAD models. The plain texture renderings additionally in¬ 
ject information from the material properties and the background images. The 
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texture transfer renderings use Pascal3D+ data, which include some ImageNet 
images too. If we add these images directly to the training set (instead of do¬ 
ing the texture transfer) we obtain 50.6 mAP (original to additional ImageNet 
images ratio is 1:3). This shows that the increased diversity of our synthetic 
samples further help improving results. Plain textures provide 2 mAP points 
improvement, and texture transfer 4 mAP points. In comparison, m reports 3 
mAP points gain (on Pascal VOC 2012 test set) when using the Pascal VOC 
2007 together with the 2012 training data (over an R-CNN variant). Our gains 
are quite comparable to such number, despite relying on synthetic renderings. 


Conclusion Synthetic renderings are an 
effective mean to increase the overall de¬ 
tection quality. Even simple wire-frame 
renderings can be of help. 

8 All-in-one 


Data 


CNN mAP AAVP 


In Tab. we show results when training 
the SVM on top of the concatenated fea¬ 
tures of the convnets fine-tuned with real 
and mixed data. We also report joint ob¬ 
ject localization and viewpoint estimation 
results (AAVP [M] measure). As in |24| . 
for viewpoint prediction we rely on a re¬ 
gressor trained on convnet features fine- 
tuned for detection. 

We observe that the texture renderings improve performance on all models 
(e.g. VGG16 58.8 to 61.9 mAP). Gombining these three models further improves 
the detection performance and achieves state-of-the-art viewpoint estimation. 


Pascal3D+ 

AlexNet 

GoogleNet 

VGG16 

comb 

51.2 

56.6 
58.8 

62.6 

35.3|24] 


AlexNet 

54.6 

_ 

Fascal3U+ 

GoogleNet 

59.1 

- 

& 

VGG16 

61.9 

- 

Texture 

comb 

64.1 

43.8 

transfer 

comb+size 

64.7 

- 


comb+bb 

66.3 

- 

comb+size+bb 

67.2 

- 


Table 2: Pascal3D+ results 


Adding size specific VGG16 models (like in S6.1) further pushes the results, im¬ 
proving (up to 5 mAP points) on small/medium sized objects. Adding bounding 
box regression, our final combination achieves 67.2 mAP, the best reported de¬ 
tection result on Pascal3D+. 


9 Conclusion 


We presented new results regarding the performance and potential of the R- 
GNN architecture. Although higher overall performance can be reached with 
deeper convnets (VGG16), all the considered state-of-the-art networks have sim¬ 
ilar weaknesses; they underperform for truncated, occluded and small objects 
(©• Additional training data does not solve these weak points, hinting that 
structural changes are needed. Despite common belief, our results suggest these 
models are not invariant to various appearance factors. Increased training data, 
however, does improve overall performance, even when using synthetic image 
renderings (©■ 

In future work, we would like to extend the GAD model set in order to cover 
more categories. Understanding which architectural changes will be most effect¬ 
ive to handle truncation, occlusion, or small objects remains an open question. 
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